Syntax Analysis using Amazon Comprehend Syntax API with AWS SDK for Python (Boto3)

yusuke

2019.02.25

この記事は公開されてから1年以上経過しています。情報が古い可能性がありますので、ご注意ください。

Amazon Comprehend announced support of Syntax Analysis. In this blog, let's perform Syntax Analysis using Amazon Comprehend Syntax API with AWS SDK for Python (Boto3).

Amazon Comprehend Now Supports Syntax Analysis

Environment

$ pip list | grep boto3
boto3 1.9.2

Sample Code

Execution result

{
"ResponseMetadata": {
"HTTPHeaders": {
"connection": "keep-alive",
"content-length": "2758",
"content-type": "application/x-amz-json-1.1",
"date": "Wed, 12 Sep 2018 16:35:03 GMT",
"x-amzn-requestid": "cc8b7643-b6a9-11e8-9f8f-71568a3ae70c"
},
"HTTPStatusCode": 200,
"RequestId": "cc8b7643-b6a9-11e8-9f8f-71568a3ae70c",
"RetryAttempts": 0
},
"SyntaxTokens": [
{
"BeginOffset": 0,
"EndOffset": 6,
"PartOfSpeech": {
"Score": 0.9970498085021973,
"Tag": "PROPN"
},
"Text": "Amazon",
"TokenId": 1
},
{
"BeginOffset": 7,
"EndOffset": 17,
"PartOfSpeech": {
"Score": 0.9976467490196228,
"Tag": "PROPN"
},
"Text": "Comprehend",
"TokenId": 2
},
{
"BeginOffset": 18,
"EndOffset": 20,
"PartOfSpeech": {
"Score": 0.9982584118843079,
"Tag": "VERB"
},
"Text": "is",
"TokenId": 3
},
{
"BeginOffset": 21,
"EndOffset": 22,
"PartOfSpeech": {
"Score": 0.9999969005584717,
"Tag": "DET"
},
"Text": "a",
"TokenId": 4
},
{
"BeginOffset": 23,
"EndOffset": 30,
"PartOfSpeech": {
"Score": 0.9993355870246887,
"Tag": "ADJ"
},
"Text": "natural",
"TokenId": 5
},
{
"BeginOffset": 31,
"EndOffset": 39,
"PartOfSpeech": {
"Score": 0.996455729007721,
"Tag": "NOUN"
},
"Text": "language",
"TokenId": 6
},
{
"BeginOffset": 40,
"EndOffset": 50,
"PartOfSpeech": {
"Score": 0.9889174699783325,
"Tag": "NOUN"
},
"Text": "processing",
"TokenId": 7
},
{
"BeginOffset": 51,
"EndOffset": 52,
"PartOfSpeech": {
"Score": 0.9999988079071045,
"Tag": "PUNCT"
},
"Text": "(",
"TokenId": 8
},
{
"BeginOffset": 52,
"EndOffset": 55,
"PartOfSpeech": {
"Score": 0.9151285290718079,
"Tag": "PROPN"
},
"Text": "NLP",
"TokenId": 9
},
{
"BeginOffset": 55,
"EndOffset": 56,
"PartOfSpeech": {
"Score": 0.9999597072601318,
"Tag": "PUNCT"
},
"Text": ")",
"TokenId": 10
},
{
"BeginOffset": 57,
"EndOffset": 64,
"PartOfSpeech": {
"Score": 0.9986529350280762,
"Tag": "NOUN"
},
"Text": "service",
"TokenId": 11
},
{
"BeginOffset": 65,
"EndOffset": 69,
"PartOfSpeech": {
"Score": 0.9936331510543823,
"Tag": "PRON"
},
"Text": "that",
"TokenId": 12
},
{
"BeginOffset": 70,
"EndOffset": 74,
"PartOfSpeech": {
"Score": 0.9999306201934814,
"Tag": "VERB"
},
"Text": "uses",
"TokenId": 13
},
{
"BeginOffset": 75,
"EndOffset": 82,
"PartOfSpeech": {
"Score": 0.9979239702224731,
"Tag": "NOUN"
},
"Text": "machine",
"TokenId": 14
},
{
"BeginOffset": 83,
"EndOffset": 91,
"PartOfSpeech": {
"Score": 0.7294206023216248,
"Tag": "VERB"
},
"Text": "learning",
"TokenId": 15
},
{
"BeginOffset": 92,
"EndOffset": 94,
"PartOfSpeech": {
"Score": 0.9947968125343323,
"Tag": "PART"
},
"Text": "to",
"TokenId": 16
},
{
"BeginOffset": 95,
"EndOffset": 99,
"PartOfSpeech": {
"Score": 0.9998737573623657,
"Tag": "VERB"
},
"Text": "find",
"TokenId": 17
},
{
"BeginOffset": 100,
"EndOffset": 108,
"PartOfSpeech": {
"Score": 0.9998371601104736,
"Tag": "NOUN"
},
"Text": "insights",
"TokenId": 18
},
{
"BeginOffset": 109,
"EndOffset": 112,
"PartOfSpeech": {
"Score": 0.9999772310256958,
"Tag": "CONJ"
},
"Text": "and",
"TokenId": 19
},
{
"BeginOffset": 113,
"EndOffset": 126,
"PartOfSpeech": {
"Score": 0.9998776912689209,
"Tag": "NOUN"
},
"Text": "relationships",
"TokenId": 20
},
{
"BeginOffset": 127,
"EndOffset": 129,
"PartOfSpeech": {
"Score": 0.9999299049377441,
"Tag": "ADP"
},
"Text": "in",
"TokenId": 21
},
{
"BeginOffset": 130,
"EndOffset": 134,
"PartOfSpeech": {
"Score": 0.9992431402206421,
"Tag": "NOUN"
},
"Text": "text",
"TokenId": 22
},
{
"BeginOffset": 134,
"EndOffset": 135,
"PartOfSpeech": {
"Score": 0.9999969005584717,
"Tag": "PUNCT"
},
"Text": ".",
"TokenId": 23
}
]
}

You can see that the text is tokenized and labeled a parts of speech, for instance, noun and verb. You can also confirm the confidence score.

The part of speech attached to the tag are summarized beßlow.

Token	Part of speech
ADJ	Adjective
ADP	Adposition
ADV	Adverb
AUX	Auxiliary
CONJ	Coordinating conjunction
DET	Determiner
INTJ	Interjection
NOUN	Noun
NUM	Numeral
O	Other
PART	Particle
PRON	Pronoun
PROPN	Proper noun
PUNCT	Punctuation
SCONJ	Subordinating conjunction
SYM	Symbol
VERB	Verb

Please refer to this documentation for details.

Syntax

Conclusion

Amazon Comprehend's Syntax Analysis can tokenize text and label each word with parts of speech and analyze it.

In this blog, we illustrated syntax analysis using Amazon Comprehend Syntax API with AWS SDK for Python (Boto3).

Please refer to the blog below about other features of Amazon Comprehend, Keyphrase Extraction,Sentiment Analysis,Entity Recognition,Language Detection, and Topic Modeling.